Week 11.5 - Data, Languages and African Model-Building

🎯 What We'll Cover

Sub-Lesson 11.4 closed on the compute layer. We turn now to the other three layers in the stack that African researchers have actually built over the last few years: the data they govern, the models they train, and the benchmarks they use to measure progress. This is the most concretely positive part of the African AI story. The infrastructure conversation in 11.4 was substantially aspirational; the data-and-models conversation in 11.5 is substantially shipping.

This sub-lesson is structured around three pillars and one survey paper. The survey is Alabi, Hedderich, Adelani & Klakow's 2025 Charting the Landscape of African NLP, which mapped 884 papers over five years and is the most comprehensive recent overview of the field. The three pillars are: a global Indigenous data-sovereignty arc that runs from Te Hiku Media's Kaitiakitanga License through the CARE Principles to the Esethu Framework; an African foundation-model inventory that is genuinely substantial, structured by family and led by InkubaLM (Lelapa) and MzansiLM (UCT); and an African benchmark stack with new 2026 additions worth foregrounding.

We close on the pedagogically most useful split in the current literature — sovereign/frontier-aspirant projects (Awarri's N-ATLAS in Nigeria) versus resource-efficient/pragmatic projects (Lelapa's InkubaLM family) — and a section on where the gaps are: the thesis-shaped opportunities that postgraduate researchers in this course are well placed to close.

📍 The Map: Charting the Landscape of African NLP

If you read only one survey paper before going further into the African NLP literature, make it this one.

📚 Alabi, Hedderich, Adelani & Klakow (2025) — Charting the Landscape of African NLP

Jesujoba O. Alabi, Michael A. Hedderich, David Ifeoluwa Adelani & Dietrich Klakow. Charting the Landscape of African NLP. EMNLP 2025 main, pp. 27807–27841. arXiv:2505.21315.

The paper systematically surveys 884 African-NLP papers over a five-year window and produces a map that is genuinely useful for navigation. It documents which languages have been worked on (and how unevenly), which tasks dominate the literature (MT, NER, sentiment, and ASR are the largest), which institutions are most active, and where the cross-cultural and cross-linguistic blind spots lie. If you are starting a project on African NLP and you do not know the lay of the land, this is where to begin.

A complement worth noting: Belay, Azime, Adelani et al.'s The Rise of AfricaNLP (arXiv:2509.25477, September 2025, updated April 2026) provides the bibliometric companion — 2,200 papers analysed for community impact and contributor patterns.

📊 The single most useful number from the recent literature

A June 2025 quantitative survey, The State of Large Language Models for African Languages: Progress and Challenges (Hussen, Sewunetie, Ayele, Imam, Muhammad & Yimam, arXiv:2506.02280), measured what current LLM families actually cover. The headline finding: across six large LLMs, eight small LMs, and six smaller models, just 42 African languages receive meaningful support. Africa has roughly two thousand languages. ~98% of African languages remain unsupported by current foundation-model infrastructure. The script picture is similar: current LLM tokenisers handle Latin, Arabic, and Ge'ez but not the roughly 20 active African scripts (Tifinagh, N'Ko, Vai, Adlam and others).

This is the single most useful framing number for the whole African-model conversation. The space is vastly under-served; the work that exists is significant precisely because of how much more there is to do.

🌍 The Global Indigenous Data-Sovereignty Arc

Before we look at the models, we look at the framework under which African researchers are choosing to release the data the models are built on. This conversation does not start with African scholarship; it sits inside a longer, global Indigenous data-sovereignty tradition. Naming that lineage matters, because the African work is part of a wider movement that has been theorising community-grounded data governance for considerably longer than the current AI moment has been running.

Te Hiku Media & the Kaitiakitanga License

Te Hiku Media (tehiku.nz), an iwi radio station and media hub representing five Far-North Māori iwi (Ngāti Kuri, Te Aupōuri, Ngāi Takoto, Te Rarawa, Ngāti Kahu), was established as Te Hiku O Te Ika at Awanui in December 1990. The organisation built the foundational ASR work for te reo Māori, reporting around 92% accuracy for monolingual Māori speech. To govern the speech data the community contributed, Te Hiku developed the Kaitiakitanga License, which treats the organisation as kaitiaki — guardian, not owner — of the data, and explicitly prohibits uses that would surveil, discriminate against, or harm Māori people. This is the most-cited Indigenous-data-license precedent in the African NLP literature.

Jones, K. & Mahelona, K. Data Sovereignty and the Kaitiakitanga License (Te Hiku Media, 2022, updated 2023). tehiku.nz.

The CARE Principles

The CARE Principles for Indigenous Data Governance (Carroll, Garba, Figueroa-Rodríguez, Holbrook, Lovett, Materechera, Parsons, Raseroka, Rodriguez-Lonebear, Rowe, Sara, Walker, Anderson & Hudson, 2020) generalise the Te Hiku-style position into a four-principle framework: Collective Benefit, Authority to Control, Responsibility, and Ethics. The principles were drafted at International Data Week in Gaborone, in November 2018, by the RDA International Indigenous Data Sovereignty Interest Group. By 2026 CARE has been adopted by the Research Data Alliance, the Global Indigenous Data Alliance (GIDA), the Australian Research Data Commons, and the US NIH's genomic-data policy.

Carroll, S. R. et al. (2020). Data Science Journal 19(1): 43. DOI 10.5334/dsj-2020-043. GIDA: gida-global.org.

The Esethu Framework

The Esethu Framework (Rajab, Aremu, Chimoto, Dunbar, Morrissey, Thior, Potgieter, Ojo, Tonja, Chetty, Nekoto, Moiloa, Abbott, Marivate & Rosman, 2025) is the African strand's most developed published proposal. It introduces a community-centric Esethu License under which commercial users — especially non-African ones — pay a fee that is reinvested into dataset expansion via local partners. The proof-of-concept dataset is the Vuk'uzenzele isiXhosa Speech Dataset (ViXSD). The author team draws across Wits, Lelapa AI, and Masakhane; the lineage is explicit about citing Kaitiakitanga as a precedent.

Rajab, J. et al. (2025). arXiv:2502.15916; ACL 2025.

The wider tradition

Two more strands belong in the global arc. OCAP (Ownership, Control, Access, Possession), developed by the First Nations Information Governance Centre in Canada since 1998, is the oldest of the institutional frameworks. Sápmi data governance, articulated by the Sámi Council and discussed at the first Sámi Research Data Governance conference in Tromsø in January 2023, articulates a parallel European-Indigenous tradition. These traditions are not interchangeable, but reading them alongside the African strand makes clear that the relational conception of data governance is genuinely global.

OCAP: FNIGC. Eriksen et al. (2024), Nordic Journal of Library and Information Studies.

🧘 The lineage in one sentence

Kaitiakitanga (Te Hiku, Māori, since the early 2010s in the AI context) → CARE Principles (drafted in Gaborone in 2018, published 2020) → Esethu Framework (Rajab et al., ACL 2025). The African work is the most recent expression of a global Indigenous tradition that explicitly traces its own intellectual lineage back through Māori and First Nations scholarship.

For policy context at the African continental level: the AU Data Policy Framework (adopted February 2022, published 28 July 2022) sets the stated continental orientation toward transparency, accountability, inclusion, and equity, aimed at an African Digital Single Market by 2030. The AU Continental AI Strategy (July 2024) explicitly invokes data sovereignty across its five focus areas. Both documents are real and worth reading; both, as we noted in 11.4, are policy positions rather than enforced regulatory mechanisms.

⚠️ An honest gap to flag

There is, as of May 2026, no peer-reviewed African critique-and-adaptation of the CARE Principles: a paper arguing what would need to change for CARE to be made specifically African rather than imported with adjustments. Abeba Birhane and Rediet Abebe's work orbits this question; the AfricaNLP licensing conversation (Marivate, Adelani, Rajab et al.) is making constructive moves in practice. But the central critical paper has not been written yet. We return to this in the gaps section at the end of the sub-lesson, because it is one of the most concrete thesis opportunities the literature currently offers.

A related practical gap: most African datasets released in 2024 and 2025 still go up under CC-BY or research-only licences, not under Kaitiakitanga- or Esethu-style community licences. The rhetorical commitment to community licensing is genuinely there; the operational uptake is not yet matching it. ViXSD is so far the proof of concept, not the beginning of a wave.

🤝 Community Infrastructure: Masakhane, Lanfrica, AfricaNLP, Indaba

Before the foundation models, before the benchmarks, before the licensed datasets — there is the community infrastructure that makes everything else possible. Four organisations or venues are doing the disproportionate share of the work.

Masakhane

Founded around 2019 out of an ICLR Africa workshop, Masakhane is the grassroots research community that has anchored modern African NLP. The flagship projects are MasakhaNER (named-entity recognition across African languages), MAFAND-MT / LAFAND-MT (machine translation), MasakhaNEWS (16 languages, news topic classification, arXiv:2304.09972), and MasakhaPOS (20 languages, part-of-speech tagging, arXiv:2305.13989). Masakhane's public-facing website at masakhane.io displays older figures, but the organisation is very much active — the GitHub and HuggingFace organisations (huggingface.co/masakhane) show dataset and code updates well into 2025, and Masakhane members are the substantial author overlap on InkubaLM, Esethu, and most of the AfricaNLP 2025 proceedings.

Lanfrica

Founded by Chris C. Emezue and Bonaventure F. P. Dossou, Lanfrica (lanfrica.com) is the pan-African resource catalogue that indexes datasets, papers, models, and projects across roughly 2,199 African languages (including extinct languages). In 2025 Lanfrica ran the NaijaVoices Language Heritage Micro-Grants programme, supporting six community-led Nigerian-language projects, and Emezue presented “Data Farming and the QRUE Frameworks” at LT4ALL 2025. If you are looking for whether a resource exists in a specific African language, Lanfrica is the right first stop.

AfricaNLP 2025 (ACL Vienna)

The Sixth AfricaNLP Workshop, co-located with ACL 2025 in Vienna on 31 July 2025, was the first archival edition of the workshop — 28 archival and 7 non-archival papers under the theme “Multilingual and Multicultural-aware LLMs”. Editors: Constantine Lignos (Brandeis), Idris Abdulmumin, and David Ifeoluwa Adelani. Proceedings: aclanthology.org/2025.africanlp-1.0/. The archival status is a small but consequential change: AfricaNLP papers now carry the full bibliographic weight of an ACL workshop proceedings.

Deep Learning Indaba 2025

The seventh Deep Learning Indaba was held in Kigali, Rwanda, 17–22 August 2025, hosted by the University of Rwanda under the theme Urunana — Hand in Hand for AI in Africa. Over 1,000 participants, 12 workshops, and the by-now-standard mix of mentoring, research showcases, and the African AI community's most important annual gathering. Indaba does not produce an archival proceedings volume; the research output sits on OpenReview and at the contributing labs.

deeplearningindaba.com/2025

🧠 The African Foundation-Model Inventory

What follows is a verified inventory of the foundation models — encoder, from-scratch decoder, adapted decoder, and named-language — that are actually shipping for African languages as of May 2026. Two centrepieces matter most for this course: InkubaLM (the first sovereign African small language model, from Lelapa AI) and MzansiLM (the UCT-built decoder covering all eleven official South African written languages).

🌏 Encoder-family foundations

AfriBERTa (Ogueji, Zhu, Lin, Waterloo, MRL 2021). The early demonstration that high-quality language modelling for low-resource African languages is possible with less than 1 GB of text. 11 African languages.
AfroXLMR (Alabi et al., arXiv:2204.06487, 2022). Multilingual adaptive fine-tuning (MAFT) of XLM-R for African languages. Updated variant AfroXLMR-Social (arXiv:2503.18247, March 2025) by Belay, Azime, Adelani et al.
AfroLM (Dossou, Tonja, Yousuf, Osei et al., arXiv:2211.03263, SustainNLP 2022). 23 African languages, active-learning-based.
Serengeti (Adebara, Elmadany, Abdul-Mageed & Inciarte, UBC NLP, arXiv:2212.10785, Findings of ACL 2023). Massively multilingual: 517 African languages and varieties. The widest-coverage encoder model published to date.

🪂 From-scratch decoder models — the African sovereign small-LM lineage

InkubaLM (Lelapa AI; arXiv:2408.17024, August 2024). Authors: Atnafu Lambebo Tonja, Bonaventure F. P. Dossou, Jessica Ojo, Jenalea Rajab, Fadel Thior, Eric Peter Wairagala, Anuoluwapo Aremu, Pelonomi Moiloa, Jade Abbott, Vukosi Marivate, Benjamin Rosman. 0.4B parameters (422M), trained on 2.4B tokens of which 1.9B are African-language tokens. Languages: isiZulu, Yoruba, Swahili, isiXhosa, Hausa, English, French. CC BY-NC 4.0. Model card: huggingface.co/lelapa/InkubaLM-0.4B. The first published from-scratch African sovereign small language model and a serious benchmark for what is achievable on accessible compute.

MzansiLM (UCT NLP group; arXiv:2603.20732, 21 March 2026; accepted at LREC 2026 in Mallorca). Anri Lombard (UCT master's researcher in computer science) led the work with Drs Francois Meyer and Jan Buys, alongside Simbarashe Mawere, Temi Aina, Ethan Wolff, Sbonelo Gumede, and Elan Novick. 125M-parameter decoder-only model trained from scratch alongside the accompanying MzansiText corpus. Covers all eleven official South African written languages: Sepedi, Sesotho, Setswana, siSwati, Tshivenda, Xitsonga, Afrikaans, English, isiNdebele, isiXhosa, and isiZulu. Reports 20.65 BLEU on isiXhosa data-to-text generation. This is the UCT contribution to the African sovereign-LM lineage and is exactly the kind of work this course's students may go on to do.

🚫 Adapted-from-Llama decoder models

AfroLlama_V1 (Jacaranda Health, HF model card). 8B parameters, fine-tuned for Swahili, Xhosa, Zulu, Yoruba, Hausa, and English. Jacaranda's companion model UlizaLlama (7B, Llama-2-based, 321M Swahili tokens) was selected into the 2025 AI for Global Development Accelerator.
Lugha-Llama (Buzaaba, Wettig, Adelani & Fellbaum, Princeton, arXiv:2504.06536, April 2025). Mixed-data adaptation of Llama for African languages. Reports state-of-the-art results on IrokoBench and a 10-point improvement on AfriQA.
Toucan (Elmadany, Adebara, Abdul-Mageed, UBC, arXiv:2407.04796, July 2024). Many-to-many translation for 150 African language pairs. Not technically Llama-based but sits in the same adapted-model family.

📣 Named-language models

A handful of African languages now have dedicated foundation-model resources. The Amharic ecosystem is the most developed.

KinyaBERT (Nzeyimana & Niyongabo Rubungo, arXiv:2203.08459, ACL 2022). Encoder model with explicit morphological analysis for Kinyarwanda.
Amharic-LLaMA / Amharic-LLaVA (Andersland, arXiv:2403.06354, March 2024). The first published Amharic-adapted LLaMA with multimodal capability.
Walia-LLM (arXiv:2402.08015, February 2024). Enhanced Amharic-adapted LLaMA.
SwahBERT (2022). Swahili encoder model.

The dataset substrate worth naming: WURA (Oladipo et al., EMNLP 2023) is a high-quality multilingual pre-training corpus covering 16 African languages, and underpins the AfriTeVa model family.

🎯 The African Benchmark Stack

Benchmarks are how a field measures progress. The African benchmark stack has expanded substantially in 2024 and 2025, with several important 2026 additions. What follows is the verified inventory you can actually use to evaluate African-language work.

Benchmark	Coverage	Venue / arXiv
AfroBench	64 languages, 15 tasks (NLU, generation, knowledge/QA, math), 22 datasets	Ojo et al. ACL 2025 Findings; arXiv:2311.07978
IrokoBench	17 typologically diverse languages; AfriXNLI (NLI), AfriMGSM (grade-school math), AfriMMLU (multi-choice knowledge)	Adelani et al. NAACL 2025 main; arXiv:2406.03368
AfriSenti	14 sentiment datasets, 14 African languages, >110,000 tweets	Muhammad et al. EMNLP 2023 + SemEval-2023 Task 12; arXiv:2302.08956
AfriHate	Hate-speech, 15 languages (Algerian/Moroccan Arabic, Amharic, Igbo, Kinyarwanda, Hausa, Nigerian Pidgin, Oromo, Somali, Swahili, Tigrinya, Twi, Xhosa, Yoruba, Zulu)	Muhammad et al. NAACL 2025 long; arXiv:2501.08284
AfriQA	Cross-lingual open-retrieval QA, 10 African languages, ~12,000 examples	Ogundepo, Gwadabe et al. EMNLP 2023 Findings; arXiv:2305.06897
AfriSpeech-200	200 hours Pan-African accented English ASR, 67,577 clips, 2,463 speakers, 120 accents, 13 countries; clinical + general	Olatunji et al. TACL 2023, 11:1669–1685
African ASR systematic review	PRISMA review: 2,062 → 71 studies; 74 datasets, 111 languages, ~11,206 hours; <15% reproducible	Imam et al. October 2025; arXiv:2510.01145
African ASR benchmarking	Whisper, XLS-R, MMS, W2v-BERT on 13 languages at 1–400 hour scales	Nahabwe et al. November 2025; arXiv:2512.10968
AfriMTEB / AfriE5	Text-embedding benchmark: 59 African languages, 14 tasks, 38 datasets. AfriE5 (contrastive-distilled adaptation) outperforms Gemini-Embeddings on this benchmark.	Uemura, Zhang, Adelani EACL 2026 main; arXiv:2510.23896

🚀 The 2026 additions worth foregrounding

Four 2026 (and late-2025) benchmark releases are particularly worth knowing about, because they expand the stack into areas it did not previously cover.

Afri-MCQA (Multimodal Cultural QA, arXiv:2601.05699, January 2026). The freshest African QA/reasoning benchmark, and the first to take cultural context seriously as an evaluation axis. If you are building anything that involves cultural reasoning on African content, start here.
NaijaVoices (arXiv:2505.20564, AfricaNLP 2025). 1,867 hours of multi-speaker Hausa, Igbo, and Yoruba speech (precisely 1,838.54 hours across 5,455 unique speakers, 645,000 unique sentences) — one of the largest African ASR datasets to date, and built under Lanfrica's “data farming” ethos.
Yankari (AfricaNLP 2025). 30 million-token monolingual Yoruba corpus.
AfroCS-xs. The first dedicated code-switched agricultural dataset, covering Afrikaans, Sesotho, Yoruba, isiZulu, and English. Code-switching is the dominant register of African urban speech; benchmarks that take it seriously are overdue.

⚖️ Two Strategic Positions

There is a real and currently unresolved disagreement inside the African AI community about what the right strategic move is. The argument is worth understanding clearly because it shapes which kinds of project a postgraduate researcher might join.

The sovereign / frontier-aspirant position

The argument: African AI sovereignty requires African frontier-scale models, trained on African data, owned by African institutions, hosted on African compute. Without that, the continent will remain technologically dependent regardless of how much application-layer work happens locally. The clearest current expression is Nigeria's N-ATLAS, a national open-source LLM built by Awarri Technologies (Silas Adekunle and Eniola Edun) in partnership with the Nigerian Federal Ministry of Communications, Innovation and Digital Economy and published by NCAIR. Funded with $3.5M seed from UNDP, UNESCO, Meta, Google, and Microsoft. Launched by Minister Bosun Tijani at UNGA80 sidelines on 25 September 2025. Now publicly available on Hugging Face as NCAIR1/N-ATLaS: 8B parameters, Llama-3-8B base, fine-tuned across English, Hausa, Igbo, and Yoruba; subject to a 1,000-user licence cap, with commercial use requiring explicit licensing from Awarri and the Federal Ministry. Cassava's compute work (11.4) and the Tanzania–Almawave Kiswahili partnership are adjacent expressions of the same strategic position.

Status: operational. Shipped September 2025, weights public on Hugging Face under a capped non-commercial-by-default licence.

The resource-efficient / pragmatic position

The argument made most explicitly by Pelonomi Moiloa at Lelapa AI: African sovereign AI capacity is best pursued by building smaller, more efficient models that the available infrastructure can actually support and that solve the problems African users actually have. The InkubaLM family, MzansiLM's 125M parameters covering all eleven SA languages, the Masakhane lineage's benchmark and dataset work, and Jacaranda's vertical-health deployment of UlizaLlama all sit in this strand. The argument is that engineering for the real constraints — compute, energy, end-user devices — is itself a sovereignty practice.

Status: shipping, used by real African users, smaller in headline numbers but more operationally honest.

The pedagogical point of putting these two positions side by side is not to declare a winner. Both are sovereignty practices in different registers; both have honest arguments behind them; the right answer for a particular project depends on what the project is for. The useful exercise is to be able to read a piece of African-AI work and identify clearly which strand it sits in, because the success criteria for the two are quite different. A 125M-parameter community-licensed model that runs on a feature-phone-class device is a success in the resource-efficient register; it is a partial step in the frontier-aspirant register. A nation-state frontier-scale model that is announced but not yet running is a partial step in the sovereign register; it is mostly absent from the resource-efficient one.

🔍 Where the Gaps Are

The single most useful thing this sub-lesson can do for the postgraduate audience is identify, concretely, what is missing in the current African data-and-models landscape. Six gaps are particularly thesis-shaped: each one is a genuine research opening that could be pursued from a UCT or comparable position.

No African-language LegalBench equivalent. Legal NLP benchmarks exist for English; the equivalent for any African language does not. Africa has substantial bodies of customary law, statutory law, and judicial decisions in many of its languages. A LegalBench-for-Swahili or LegalBench-for-isiXhosa would be a benchmark contribution and a step toward AI-supported access to law for people who do not work in English.
No African clinical QA benchmark grounded in local guidelines. MedQA exists; an African equivalent that takes local clinical guidelines, local epidemiology, and local pharmacological reality seriously does not. This is the kind of benchmark that would let African researchers evaluate foreign frontier models against local clinical reality before deploying them in healthcare settings.
No PRISMA-style reproducibility audit of African NLP. Imam et al.'s 2025 ASR review is the closest equivalent, and found <15% of African ASR studies reproducible. A wider study covering the full African NLP literature would surface the actual size of the reproducibility gap and would, by itself, be a publishable contribution.
No published Africa-specific critique-and-adaptation of CARE. A paper arguing what CARE would need to look like as a genuinely African framework rather than a wholesale Indigenous import is missing. The constructive work (Esethu) is there; the critical-theoretical companion is not.
No continent-level inventory of endangered-language ML resources. N|uu (Khoisan), Khwedam, and several Cushitic, Nilotic, and Khoisan languages are at risk; the inventory of which of them have any ML data at all does not exist. SADiLaR's University of Pretoria node and the Wits DSFSI lab hold parts of the picture; a unified continental inventory would let endangered-language documentation and AI work coordinate more effectively.
Operational uptake of community licensing remains low. Despite Esethu, despite Kaitiakitanga, despite CARE, most African datasets in 2024 and 2025 still ship under CC-BY or research-only licences. There is room for both empirical research on why, and for practical work helping projects move to community-grounded licences when it is appropriate.

If you are a postgraduate looking for a thesis topic in this part of the AI landscape, any of these gaps is a real opening — not in the sense that they are easy, but in the sense that the field will recognise the contribution.

🎯 What This Means for Your Research

Use the benchmarks. The African benchmark stack is real and broad enough to evaluate work in most of the major NLU and ASR areas. If your project involves African-language NLP and you are not evaluating against at least one of AfroBench / IrokoBench / AfriMTEB / Afri-MCQA / AfriHate / AfriSenti / AfriSpeech-200, you are working blind.
Use the models that exist. For inference and adaptation work on the languages they cover, InkubaLM, MzansiLM, AfroXLMR, Serengeti, and the named-language models are usable starting points. The frontier-aspirant N-ATLAS-style projects are not yet downloadable; the resource-efficient ones are.
Consider community licensing when releasing data. If you collect or curate a dataset, especially one with significant cultural or linguistic content, look at Esethu and Kaitiakitanga before defaulting to CC-BY. The community-licensing track is genuinely more aligned with the African sovereignty conversation; defaulting away from it without considering it is a missed opportunity.
Pick the gaps deliberately. The six gaps above are not the only ones, but they are concrete and verifiable. If you are starting a project, choosing one of them by intention is more likely to produce a publishable contribution than choosing a topic only because foundation-model work is currently fashionable.
Resource-efficient is a sovereignty practice. The Lelapa argument generalises: matching your work to the compute and data you actually have access to, rather than to the headline numbers of foreign labs, is the operationally honest version of the sovereignty position covered in 11.4.

✏️ A Short Exercise

Pick a language and a domain relevant to your current or planned research project. The domain might be medicine, agriculture, law, climate science, education, the humanities, or another.
Check what benchmarks and datasets exist for that language × domain combination. Use the table above; check Lanfrica; check the AfricaNLP 2025 proceedings.
Identify what is missing. Is there a benchmark? Is there a dataset of the right size and quality? Is there a model that handles the language?
Write a one-paragraph proposal for what you would build first if you were going to fill that gap. Be honest about scope — would this be a year of work, a thesis, a community collaboration?
Bring it to class. We will pool the proposals across the cohort and look at which gaps the group together could imaginably address.

📚 Sources & Further Reading

📄 Sovereignty & data governance

Carroll, S. R. et al. (2020). The CARE Principles for Indigenous Data Governance. Data Science Journal 19(1): 43. DOI 10.5334/dsj-2020-043.

Jones, K. & Mahelona, K. (2022/2023). Data Sovereignty and the Kaitiakitanga License. Te Hiku Media. tehiku.nz.

Rajab, J. et al. (2025). The Esethu Framework. arXiv:2502.15916; ACL 2025.

African Union (2022). AU Data Policy Framework. au.int.

African Union (2024). Continental Artificial Intelligence Strategy. au.int.

First Nations Information Governance Centre. OCAP Principles. fnigc.ca.

📄 Models & community infrastructure

Alabi, J. O., Hedderich, M. A., Adelani, D. I. & Klakow, D. (2025). Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead. EMNLP 2025 main, pp. 27807–27841. arXiv:2505.21315.

Belay, T. D., Azime, I. A., Adelani, D. I. et al. (2025). The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis. arXiv:2509.25477.

Hussen, K. Y., Sewunetie, W. T., Ayele, A. A., Imam, S. H., Muhammad, S. H. & Yimam, S. M. (2025). The State of Large Language Models for African Languages: Progress and Challenges. arXiv:2506.02280.

Tonja, A. L. et al. (2024). InkubaLM: A small language model for low-resource African languages. arXiv:2408.17024. Model: huggingface.co/lelapa/InkubaLM-0.4B.

Lombard, A. et al. (2026). MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages. arXiv:2603.20732; LREC 2026.

Buzaaba, H., Wettig, A., Adelani, D. I. & Fellbaum, C. (2025). Lugha-Llama. arXiv:2504.06536.

Elmadany, A., Adebara, I. & Abdul-Mageed, M. (2024). Toucan. arXiv:2407.04796.

Adebara, I., Elmadany, A., Abdul-Mageed, M. & Inciarte, A. A. (2023). SERENGETI. arXiv:2212.10785; Findings of ACL 2023.

AfricaNLP 2025 proceedings. Lignos, C., Abdulmumin, I. & Adelani, D. I. (eds.). aclanthology.org.

Masakhane. masakhane.io; huggingface.co/masakhane.

Lanfrica. lanfrica.com.

Deep Learning Indaba 2025 (Kigali). deeplearningindaba.com/2025.

📄 Benchmarks

AfroBench — Ojo et al., ACL 2025 Findings. arXiv:2311.07978.

IrokoBench — Adelani et al., NAACL 2025 main. arXiv:2406.03368.

AfriSenti — Muhammad et al., EMNLP 2023. arXiv:2302.08956.

AfriHate — Muhammad et al., NAACL 2025 long. arXiv:2501.08284.

AfriQA — Ogundepo, Gwadabe et al., EMNLP 2023 Findings. arXiv:2305.06897.

AfriSpeech-200 — Olatunji et al., TACL 2023.

African ASR systematic review — Imam et al. (October 2025). arXiv:2510.01145.

African ASR benchmarking — Nahabwe et al. (November 2025). arXiv:2512.10968.

AfriMTEB & AfriE5 — Uemura, Zhang & Adelani, EACL 2026. arXiv:2510.23896.

Afri-MCQA — Multimodal Cultural QA, January 2026. arXiv:2601.05699.

NaijaVoices, Yankari, AfroCS-xs — AfricaNLP 2025 proceedings (see ACL Anthology link above).

Coming up in 11.6: we turn from data and models to the remaining two layers of the stack — policy and talent. The African Union Continental AI Strategy in its operational detail, the national strategies of South Africa / Kenya / Rwanda / Nigeria / Egypt (and the peer-reviewed comparative analysis in Yilma & Wodajo's Science and Public Policy special section), and the institutional landscape of Deep Learning Indaba, AIMS, ARIN, and Lelapa as the talent pipeline. We close 11.6 with the question that opens Week 12 (the integrative capstone): given everything we now know about African AI capacity, what is the most useful thing a postgraduate in this room can actually do?